Uniformize kwargs for Idefics/2 processors #32568

yonigozlan · 2024-08-09T16:10:30Z

What does this PR do?

Adds uniformized processors kwargs following #31911 for the following models:

Idefics
Idefics2

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

@molbap @zucchini-nlp @amyeroberts

HuggingFaceDocBuilderDev · 2024-08-09T16:29:54Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

yonigozlan · 2024-08-09T16:50:32Z

src/transformers/models/idefics/processing_idefics.py

+        # for BC
+        if text is None:
+            # if the user didn't specify text=text in the call, we assume they want to use the old behavior
+            # with text (previously prompts) as a first argument
+            warnings.warn(
+                "The use of `text` as the first argument will be deprecated in the future. `images` is now the first argument."
+                "The first given argument will be considered as `prompts` in the old behavior.",
+            )
+            text = images
+            images = None
+        if images is None:
+            # assuming the user wants to use the old behavior with prompts as the only argument
+            prompts = text
+        elif text is not None:
+            # Assuming image-text-to-text behavior:
+            # Check if batched images are provided
+            if not isinstance(images, (list, tuple)):
+                images = [images]
+            if isinstance(text, str):
+                # one prompt for all images instead of one prompt per image
+                text = [text] * len(images)
+            # Check if batched text is provided
+            if isinstance(text, (list, tuple)) and len(text) != len(images):
+                raise ValueError(
+                    "When using the image-text-to-text behavior, the number of prompts should be the same as the number of images."
+                )
+            # Check that only text is present in the prompts
+            if not all(isinstance(i, str) for i in text):
+                raise ValueError("When using the image-text-to-text behavior, the prompts should only contain text.")
+            prompts = list(zip(images, text))


Lots of logic is needed for backward compatibility, as idefics used to take only prompts where text and images inputs would be interleaved. This added logic preserve supports for these kind of inputs (where prompts is replaced by text arg), while adding support for usual text and images inputs as in other image-text-to-text models. This will also be useful to support idefics in the image-text-to-text pipeline.

zucchini-nlp

Also looks good to me. Just want to clarify what will be the new format for Idefics to make the pipeline happy. Maybe we can add a test for that new format :)

zucchini-nlp · 2024-08-12T05:41:27Z

src/transformers/models/idefics/processing_idefics.py

+            if isinstance(text, str):
+                # one prompt for all images instead of one prompt per image
+                text = [text] * len(images)


I guess this code block is for new processing behavior when users pass images and text.

Not very sure this is a good idea to repeat text several times. Suppose user has one prompt with interleaved images-text, then we would replicate the prompt several times and cause error in downstream modeling code. For ex:

processor(text=["User: What do you see here? Assistant: a cat. User: what about this image?"], images=[image1, image2])

Yes that's a good point. Although interleaved images-text is not really supported when providing both images and text for Idefics, as there is no way to indicate where to put the images in the prompt. Maybe I should add a warning here instead of automatically duplicating the prompts?

Ah I see now, indeed Idefics is a bit peculiar.

Yes interleaving like that is not, but providing more than 1 image per prompt like in multi-turn conversation is okey, as in the dosctring of call method. Then we should expect users to pass as many images as prompts, and they would have to wrap images as a batched list if there's more than one per prompt.

I think we can even raise an error, as we cannot know for sure what is the user expecting with these inputs. An error explaining what kind of input we want and let the user fix it, otherwise users who never read warnings might start complaining in the issues :)

Added support for multiple images per prompt, and this warning to make it clearer what input format we expect when using image-text-to-text format:

transformers/src/transformers/models/idefics/processing_idefics.py

Lines 353 to 358 in 8b171a7

# Check if batched images and text are in the correct format

if isinstance(text, (list, tuple)) and len(text) != len(images):

raise ValueError(

"When providing both images and text arguments, the number of text prompts should be the same as the number of images."

"If you want to have several images per prompt, images should be nested as such: images=[[img1, img2], [img3, img4], ...] for text=[prompt1, prompt2, ...]."

)

src/transformers/models/idefics/processing_idefics.py

tests/models/idefics/test_processor_idefics.py

zucchini-nlp · 2024-08-12T06:00:31Z

tests/models/idefics/test_processor_idefics.py

+    def test_tokenizer_defaults_preserved_by_kwargs(self):
+        if "image_processor" not in self.processor_class.attributes:


imo we don't have to overwrite all tests for idefics, seems quite similar to the general test from mixin except for padding="max_length". We can and maybe should indicate padding="max_length" in mixin tests, cause we can't assume all tokenizer will default to "max_length" padding

src/transformers/models/idefics/processing_idefics.py

… idefics

yonigozlan mentioned this pull request Aug 9, 2024

Uniform kwargs for processors #31911

Open

40 tasks

yonigozlan marked this pull request as ready for review August 9, 2024 16:14

yonigozlan requested review from amyeroberts, molbap and zucchini-nlp August 9, 2024 16:14

yonigozlan commented Aug 9, 2024

View reviewed changes

yonigozlan mentioned this pull request Aug 10, 2024

Add Idefics 3! #32473

Open

5 tasks

zucchini-nlp reviewed Aug 12, 2024

View reviewed changes

andimarafioti reviewed Aug 13, 2024

View reviewed changes

src/transformers/models/idefics/processing_idefics.py Outdated Show resolved Hide resolved

yonigozlan added 4 commits August 13, 2024 21:41

Add uniformize idefics processor kwargs and tests

a94cf09

Uniformize idefics2 processor kwargs

8af76fa

add image_processor tests idefics

e6747ff

add BC args order change idefics2 processor and update doc

1a231cb

yonigozlan mentioned this pull request Aug 14, 2024

Standardize image-text-to-text-models outputs #32471

Open

25 tasks

Add support for multiple images per prompt in image-text-to-text mode…

8b171a7

… idefics

yonigozlan force-pushed the uniformize-processors-kwargs-idefics-idefics2 branch from 9799303 to 8b171a7 Compare August 14, 2024 14:13

Fix processor input args in idefics tests

747fbe1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uniformize kwargs for Idefics/2 processors #32568

Uniformize kwargs for Idefics/2 processors #32568

yonigozlan commented Aug 9, 2024 •

edited

Loading

HuggingFaceDocBuilderDev commented Aug 9, 2024

yonigozlan Aug 9, 2024

zucchini-nlp left a comment

zucchini-nlp Aug 12, 2024

yonigozlan Aug 12, 2024

zucchini-nlp Aug 13, 2024

yonigozlan Aug 15, 2024

zucchini-nlp Aug 12, 2024

	# Check if batched images and text are in the correct format
	if isinstance(text, (list, tuple)) and len(text) != len(images):
	raise ValueError(
	"When providing both images and text arguments, the number of text prompts should be the same as the number of images."
	"If you want to have several images per prompt, images should be nested as such: images=[[img1, img2], [img3, img4], ...] for text=[prompt1, prompt2, ...]."
	)

		def test_tokenizer_defaults_preserved_by_kwargs(self):
		if "image_processor" not in self.processor_class.attributes:

Uniformize kwargs for Idefics/2 processors #32568

Are you sure you want to change the base?

Uniformize kwargs for Idefics/2 processors #32568

Conversation

yonigozlan commented Aug 9, 2024 • edited Loading

What does this PR do?

Before submitting

Who can review?

HuggingFaceDocBuilderDev commented Aug 9, 2024

yonigozlan Aug 9, 2024

Choose a reason for hiding this comment

zucchini-nlp left a comment

Choose a reason for hiding this comment

zucchini-nlp Aug 12, 2024

Choose a reason for hiding this comment

yonigozlan Aug 12, 2024

Choose a reason for hiding this comment

zucchini-nlp Aug 13, 2024

Choose a reason for hiding this comment

yonigozlan Aug 15, 2024

Choose a reason for hiding this comment

zucchini-nlp Aug 12, 2024

Choose a reason for hiding this comment

yonigozlan commented Aug 9, 2024 •

edited

Loading